Insert a concise (max 200 word) exectutive summary. It should be a clear, interesting summary of main insights from the report.
Background to report Motor vehicle collisions are a major cause of death and injury in New York (ref). The purpose of this report is to inform stakeholders of the geospatial patterns of motor vehicle collisions in NYC from 2012-2021, whilst providing some information on the initial data analysis. Relevant stakeholders who may benefit from the research presented in this report include the New York Police Department (NYPD), NYC Open Data, NYC Government, motor vehicle manufacturers, car insurance companies, road safety professionals, NYC pedestrians, cyclists and motorists, and the general public.
Assessment of data provenance The Motor Vehicle Collisions crash data (available at: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) was sourced from NYC OpenData and was provided by the New York Police Department (NYPD) for public safety purposes. The dataset is classified as free public data, and the NYC OpenData website includes thorough information on attribution, creation date, and the data generation process. The data was collected by police officers, who completed a MV-104AN report for all vehicle collisions in NYC that resulted in human injuries or fatalities, or at least $1000 worth of damage. From 1999-2016, only very basic data was collected by police officers as not all MV-104AN fields were entered. With the introduction of the Finest Online Records Management System (FORMS) in 2016, police officers began entering all MV-104AN fields using an electronic device. Thus, data prior to 2016 may contain fewer collision details compared to data after 2016. An additional limitation to this dataset is the absence of data prior to 2014, which prevents our ability to analyse trends over a longer period. The data is reliable as it was inputted by police officers who were trained in the data collection process. However, potential human error must still be considered. The dataset is regularly updated (daily) and a ‘MVCDataDictionary’ spreadsheet is available which records the revision history of the dataset. No changes have been documented on the spreadsheet thus far.
Domain knowledge NYC is one of the busiest cities in the world with a population of ~8,340,000 across 783.8 km². The city consists of five boroughs including Brooklyn, the Bronx, Manhattan, Queens and Staten Island. The Cross Bronx Expressway in NYC has been identified as the most congested road in the U.S. Other highly congested roads in NYC include sections of the George Washington Bridge, Fifth Avenue, and the Lincoln Tunnel. Previous reports indicated that pedestrians were ten times more likely to be killed than motorists in a crash, and most serious crashes involved private vehicles. In terms of geospatial trends, pedestrian collisions were two-thirds deadlier on major roads compared to smaller streets. Manhattan had four times more pedestrians injured or killed compared to the other boroughs. For ethics and privacy purposes, the Motor Vehicle Collisions dataset does not reveal any confidential information about the individuals involved in the collisions.
Data structure The Motor Vehicle Collisions crash table consists of 1.7 million rows and 29 columns. Each row corresponds to a motor vehicle collision, whereas the columns provide details of the crash including the date, time, location (borough, zip code, latitude, longitude, street names), number of persons/pedestrians/cyclists/motorists killed or injured, contributing factors, and vehicle type.
library(tidyverse) # piping `%>%`, plotting, reading data
library(skimr) # exploratory data summary
library(naniar) # exploratory plots
library(kableExtra) # tables
library(lubridate) # for date variables
library(plotly)
nyc = read.csv("MVC.csv")
#nyc %>% glimpse()
#nyc %>% summary()
cleannyc <- nyc[!(nyc$LONGITUDE == "" | nyc$LATITUDE == "" | nyc$LOCATION == "" | nyc$LATITUDE == 0 | nyc$LONGITUDE == 0),]
#cleannyc %>% glimpse()
#table(is.na(cleannyc))
cleannyc = na.omit(cleannyc)
#vis_miss(cleannyc, warn_large_data = FALSE)
#boxplot(cleannyc[,11:18],cex.axis = 0.6, las = 1, horizontal = TRUE,par(mar= c(5, 10, 4, 2) + 0.1))
As we can see in the boxplot above, there are many outliers especially in the number of motorist injured, and number of persons injured. Upon inspection when there was the number of motorist injured, it occurred at 9/9/2013 and is a Brooklyn Bus Accident which left 43 people injured when a car collided head on with a bus. And it also turns out that this is the same entry for the outlier in number of persons injured. The reasons why it is in both persons and motorist category is because the 43 people are in the bus, therefore classified as motorists. These outliers without inspection may seem extraordinary and perhaps a possibility of being faulty data collection, however with a further glance they seem to be valid and an important part of our data analysis. In fact in comparison to the mean and median of all these columns, most of the circles shown in the graph are considered outliers. Evidently the median and mean are around 0 accidents, which is expected. Because we have so many entries in data, and the probability of being in an accident is relatively small, this graph is exposed to skewedness, and thus we cannnot say that all these data points greater than 0 are outliers.
#max(cleannyc$NUMBER.OF.MOTORIST.INJURED)
#cleannyc %>% filter(NUMBER.OF.MOTORIST.INJURED == 43)
#max(cleannyc$NUMBER.OF.PERSONS.INJURED)
#cleannyc %>% filter(NUMBER.OF.PERSONS.INJURED == 43)
Even though these outliers are valid, they will affect our aggregate data, by dragging the mean higher than it should be. This is why median is much better than using mean, as it is not as affected by high outliers. We do not care about low outliers as the base is 0 and cannot fall lower. We can also take a look into more details and the affect of these two outliers using the graph below.
#fig = plot_ly(y = cleannyc$NUMBER.OF.PERSONS.INJURED, type = "box", name = "Number of Persons Injured")
#fig = fig %>% add_trace(y = cleannyc$NUMBER.OF.MOTORIST.INJURED, name = "Number of Motorists Injured") %>% layout(title = "Persons Injured and Motorist Injured Outlier Analysis")
#fig
Here we explore the frequencies of road accidents happening according to the time of the day. To provide emergency assistance to victims, the NYC Fire Department (FDNY) and (FDNY) Bureau of Emergency Medical Services (EMS) has 340 fire engines and 450 ambulances respectively1.
In the graph below we see the trend of no. of road accidents happening over the course of 2012 to 2019.
as.Date(nyc$CRASH.DATE, tryFormats = c("%m/%d/%Y")) -> nyc$CRASH.DATE
as.data.frame(table(nyc$CRASH.DATE)) -> date_table
plot_ly(date_table, x= ~Var1, y = ~Freq, type = "scatter", mode = "lines", text = paste(date_table$Var1, ", ", date_table$Freq)) -> fig1
fig1 %>% layout(title = "No. of accidents per day in NYC (2012-2019)",
yaxis = list(title = "No. of Accidents (in 1 day)"),
xaxis = list(title = "Date",
type = "date",
range = c("2012-07-01", "2019-12-03"))) -> fig1
fig1
From the above graph we can see that the frequency of road accidents happening per day has been fairly constant over the years. On an average, New York City has 595.8 road accidents per day. But to counter this, the FDNY average response time for vehicle accidents stands at a poor 60 minutes and 27 seconds2 which is 4 times the national average of 15 minutes and 19 seconds3.
According to Mark Cunningham in EMS. One More Time(2015)4, the reason for this poor performance is systemic inefficiency ranging from fewer working hours for EMS drivers to poor algorithm design for assigning operators.
This problem can be addressed by understanding and analyzing the trends in traffic incidences. For instance, it can be observed from the above graph that 25th December is often the least violent day on the road of the year. By exploring more such trends, the FDNY would be better equipped to improve their response time and save numerous lives.
cleannyc %>% separate(CRASH.TIME, c("hour", "min"), ":") %>% separate(CRASH.DATE, c("Month", "Day", "Year"), "/", remove = FALSE) -> cleannyc
as.data.frame(table(cleannyc$hour)) -> time_freq
time_freq[order(time_freq$Var1),] -> time_freq$Var1
time_freq$Var1$Freq <- NULL
cleannyc %>% mutate(
killed = (NUMBER.OF.PERSONS.KILLED +
NUMBER.OF.PEDESTRIANS.KILLED +
NUMBER.OF.CYCLIST.KILLED +
NUMBER.OF.MOTORIST.KILLED
),
injured = (NUMBER.OF.PERSONS.INJURED +
NUMBER.OF.PEDESTRIANS.INJURED +
NUMBER.OF.CYCLIST.INJURED +
NUMBER.OF.MOTORIST.INJURED
)) %>% mutate(
score = (killed * 0.1 + injured*0.025)) -> cleannyc
We first look at the the timings of the accidents in the day. The following graph plots the number of road crash victims against the time of the accident.
i = 0
kills=c()
injuries= c()
while (i<24){
sum(cleannyc[cleannyc$hour == i,]$killed) -> killed_sum
sum(cleannyc[cleannyc$hour == i,]$injured) -> injured_sum
kills = c(kills, killed_sum)
injuries = c(injuries, injured_sum)
i = i+1
}
as.data.frame(cbind(time_freq, kills, injuries)) -> time_score
fig3 = plot_ly(time_score, x = ~Var1, y = ~kills, type = 'scatter', mode = "lines+markers", name = "Deaths")
fig3 %>% add_trace(y = ~injuries, name = "Injuries") -> fig3
fig3 <- fig3 %>% layout(yaxis = list(title = 'Count'), barmode = 'group')
fig3 %>% layout(title = "Hourly deaths and injuries from road accidents",
xaxis = list(title = "Time of the day",
ticktext = list("00:00-01:00", "01:00-02:00", "02:00-03:00","03:00-04:00", "04:00-05:00","05:00-06:00","06:00-07:00", "07:00-08:00", "08:00-09:00", "09:00-10:00", "10:00-11:00","11:00-12:00","12:00-13:00", "13:00-14:00", "14:00-15:00", "15:00-16:00", "16:00-17:00", "17:00-18:00", "18:00-19:00", "19:00-20:00", "20:00-21:00", "21:00-22:00", "22:00-23:00", "23:00-00:00"),
tickvals = list(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23)),
yaxis = list(title = "No. of Cases")) -> fig3
fig3
From the graph, we see a spike in accidents between 8 to 9 am. This is when people go to their workplaces. We can also observe that the highest amount of fatalities are found between the timings 4 pm to 6pm.
This is usually the time when the working class return from their offices. It also coincides with people going out in the evening. Thus, this increases the chances of an accident happening. Since people are usually tired after a day’s work, the chances of driving errors is higher as well.
During these times, the FDNY and emergency medical services should be the most alert. Since New York firemen work in 8-hour long shifts, it should be ensured that the shift change should not happen during these periods.
We now explore which days of the week are the most prone to accidents. The graph below plots the number of accidents against the days of the week.
day = weekdays(as.Date(cleannyc$CRASH.DATE, tryFormats = "%m/%d/%Y"))
cleannyc = cbind(cleannyc, day)
as.data.frame(table(cleannyc$day)) -> day_table
factor(day_table$Var1, levels= c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday")) -> day_table$Var1
fig4 = plot_ly(day_table, x = ~Var1, y = ~Freq, type = "bar")
fig4 %>% layout(title = "Weekly trends in Road Accidents (2012-2019)",
yaxis = list(title ="No. of Accidents"),
xaxis = list(title = "Day of the Week")) -> fig4
fig4
Here we observe that Friday has the highest amount of road accidents happening and the weekends: Saturday and Sunday have significantly fewer accidents compared to weekdays.
Fridays tend to be crowded on the roads as most people go out during Friday evenings before the weekend.
Apart from Friday, the no. of accidents occuring is around the same during the weekdays. The weekends are safer since most workplaces are closed and thus, fewer people go out during the day.
The emergency services dept. should consider having longer shifts for emergency personnel on the days with higher traffic.
Apart from this, New York City Fire Department should consider investing into hiring more personnel and increasing emergency stations around the city to reduce the response time. A limitation of this dataset was that it does not contain independent contractual service providers which are common in New York due to its inefficient emergency services. These providers are contractually obligated to provide service in a given timeframe. Hence, they perform much better than the FDNY.
By incorporating the learnings from the above observations, the FDNY can aim to become a competent and trustable emergency service provider.
Insert your reflection on how data wrangling helped you explore your research questions. (Don’t forget to adjust information at the top of report regarding your name in the author field etc!!)